This book is a first draft, and I am actively collecting feedback to shape the final version. Let me know if you spot typos, errors in the code, or unclear explanations, your input would be greatly appreciated. And your suggestions will help make this book more accurate, readable, and useful for others. You can reach me at: Email:contervalconsult@gmail.com LinkedIn:www.linkedin.com/in/jorammutenge Datasets:Download all datasets
Knowledge is potential power. Knowledge used is power.
— Joram Mutenge
In the previous chapters, we explored how powerful and versatile Polars can be for data analysis. In this chapter, we bring those ideas together by applying what you have learned to answer ten questions based on the library_checkouts dataset. Learning a tool has limited value if you never put it into practice. Application is where understanding happens.
I strongly encourage you to attempt each question before reviewing the solutions. If you find yourself stuck, return to earlier chapters to refresh your memory. Revisiting concepts in context will help reinforce what you have learned and deepen your confidence.
8.1 The Dataset
The library_checkouts dataset contains records of items checked out by patrons at public libraries across Brisbane, Australia. Library names and item types are stored as acronyms. To make the results easier to interpret, we will introduce a second dataset, library_acronyms, which maps these acronyms to their full names. Working with full names allows us to more easily recognize libraries and items throughout the analysis.
Assign simpler, more descriptive names to the selected columns.
Next, we join the two acronym dataframes, library_acronyms and item_acronyms, with the main dataset. This produces the final dataframe that we will use to answer all ten questions.
Exclude columns names ending with the text “Full”.
shape: (3, 8)
Title
Author
ID
Item
Lang
Age
Library
Date
str
str
str
str
str
str
str
datetime[ms, Australia/Brisbane]
"Scooby-Doo, where are you?."
null
"34000104954037"
"Junior Magazines"
null
"JUVENILE"
"Coopers Plains"
2020-02-06 19:59:33 AEST
"Scooby-Doo, where are you?."
null
"34000107294902"
"Junior Magazines"
null
"JUVENILE"
"Coopers Plains"
2020-02-06 19:59:33 AEST
"Scooby-Doo's creepiest capers"
null
"34000106761927"
"DVD"
null
"JUVENILE"
"Toowong"
2020-02-06 20:14:31 AEST
8.2 The Ten Questions
The questions in this section draw on a wide range of concepts covered throughout the book. They are intentionally challenging. Do not be discouraged if you cannot answer all of them on your first attempt. Their purpose is to strengthen your understanding of Polars through practice.
For each question, I have included the expected output to give you a sense of what your solution should produce. Once again, try to answer the questions on your own before jumping directly to the solutions.
What is the title of the item checked out in October 2020 at 23:46:47?
What are the five most checked-out adult fiction books in 2025?
┌───────────────────────────────────┐
│ Title │
│ --- │
│ str │
╞═══════════════════════════════════╡
│ Traitors gate │
│ Tell me everything : a novel │
│ Mr Einstein's secretary : an epic │
│ Turn a blind eye │
│ Exiles │
└───────────────────────────────────┘
Create a multi line chart showing the 7-day average and median checkouts for Toowong library.
Use your local time zone and show the day and time when one of my favorite books, This Side of Paradise, was checked out. Note: The values will reflect your local time zone, but the shape of the dataframe should remain the same.
┌───────────────────────┬───────────────────────────────┬───────────┬──────────┐
│ Title ┆ Date ┆ Day ┆ Time │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ datetime[ms, America/Chicago] ┆ str ┆ time │
╞═══════════════════════╪═══════════════════════════════╪═══════════╪══════════╡
│ This side of paradise ┆ 2022-08-08 07:57:21 CDT ┆ Monday ┆ 07:57:21 │
│ This side of paradise ┆ 2020-01-06 03:50:08 CST ┆ Monday ┆ 03:50:08 │
│ This side of paradise ┆ 2020-03-06 05:20:32 CST ┆ Friday ┆ 05:20:32 │
│ This side of paradise ┆ 2020-06-08 06:30:04 CDT ┆ Monday ┆ 06:30:04 │
│ This side of paradise ┆ 2022-06-07 07:54:41 CDT ┆ Tuesday ┆ 07:54:41 │
│ This side of paradise ┆ 2023-12-08 04:44:55 CST ┆ Friday ┆ 04:44:55 │
│ This side of paradise ┆ 2024-05-08 21:35:33 CDT ┆ Wednesday ┆ 21:35:33 │
└───────────────────────┴───────────────────────────────┴───────────┴──────────┘
Create a heatmap showing the frequency of checkouts, with days on the y-axis and months on the x-axis.
Create a single column dataframe that shows a numbered list of Cal Newport books. The first row should be the full author name and book titles should be in the subsequent rows. Tip: this involves using intermediate variables.
┌───────────────────────────────────────────────────────────────────────────────────┐
│ Author │
│ --- │
│ str │
╞═══════════════════════════════════════════════════════════════════════════════════╡
│ Cal Newport │
│ 1. A world without email : reimagining work in the age of overload │
│ 2. Deep work : rules for focused success in a distracted world │
│ 3. Digital minimalism : on living better with less technology │
│ 4. Slow productivity : the lost art of accomplishment without burnout │
│ 5. So good they can't ignore you : why skills trump passion in the quest for wor… │
└───────────────────────────────────────────────────────────────────────────────────┘
Create a mult-year line chart showing the biography checkouts over the years.
8.3 Solutions to the Questions
In this section, I present solution code for the ten questions asked in the previous section. I hope you took some time to work through them on your own. If not, that is perfectly fine as well.
It is important to note that your solution code does not need to match mine exactly. One of the advantages of Polars is that it allows you to express the same logic in multiple ways while producing identical results. As long as your output matches the expected result, your approach is valid. In fact, your solution may feel more intuitive since it reflects your own reasoning. At the very least, reviewing my solutions may offer a different perspective on how each question can be approached.
8.3.1 Question One
What is the title of the item checked out in October 2020 at 23:46:47?
The key idea behind this question is that it requires filtering the dataframe using datetime components. In particular, it involves filtering on specific parts of a timestamp. The most challenging aspect is applying filter correctly when working with time values.
The colons : in the time value are replaced with commas , when using pl.time.
2
The returned Title value includes a trailing period .. This line removes that character.
shape: (1, 1)
Title
str
"Downton Abbey"
8.3.2 Question Two
What proportion of DVD checkouts between Banyo and Hamilton does each library account for?
The most subtle part of this question is remembering that filtering by equating the Item value to “DVD” does not capture all DVD-related items. When filtering text values, it is generally safer to use contains rather than == or eq. The following examples illustrate why this matters. First, consider the items that are exactly equal to “DVD”.
A common mistake is to assume that percentage calculations should be based on the entire dataframe. Here, percentages are calculated only after filtering for the target libraries, Banyo and Hamilton.
shape: (2, 2)
Library
Pct_Checkout
str
f64
"Banyo"
55.71
"Hamilton"
44.29
8.3.3 Question Three
Create a bar chart showing the five most popular checked-out items.
Some rows in the Item column contain null values. Since null values should not be counted among the most popular items, they must be filtered out before aggregation.
from bokeh.models import NumeralTickFormatter(library_checkouts.filter(pl.col('Item').is_not_null()).group_by('Item').len()1.top_k(5, by='len').hvplot.barh(x='Item', y='len', xlabel='', ylabel='').opts(title='Top 5 most popular items checked-out',2 xformatter=NumeralTickFormatter(format='0a')))
1
This retrieves the top values directly without requiring an explicit sort of the dataset.
2
This ensures that numeric labels are not displayed in scientific notation, with k representing thousands and m representing millions.
8.3.4 Question Four
Who is the most popular non-fiction adult author in each of the top five libraries by checkouts?
This question requires multiple filters and, most importantly, the removal of null values in the Author column.
Sorting at this stage is essential to ensure that the most popular authors are selected for each library in the subsequent group_by. Without sorting, there is no guarantee that the top results are returned.
2
Author names are recorded in last-name-first format, such as “Steves, Rick”. This line reverses the order and removes the comma so the names appear in a more natural format.
shape: (5, 2)
Library
Author
str
str
"Ashgrove"
"Rick Steves"
"Chermside"
"Patrick. Holford"
"Grange"
"Jamie Oliver"
"Indooroopilly"
"Noel Whittaker"
"Wynnum"
"Jamie Oliver"
8.3.5 Question Five
What are the five most checked-out adult fiction books in 2025?
This question focuses only on the titles of the most checked-out books. The number of checkouts for each title does not need to be displayed.
The dataset does not include an item labeled simply “Fiction”. Instead, values such as “Adult Fiction” and “Junior Fiction” are used, which makes contains necessary. However, contains also matches “Non Fiction”, so the ne filter removes those rows.
2
This removes the trailing space and forward slash / from the end of the book titles.
shape: (5, 1)
Title
str
"Traitors gate"
"Tell me everything : a novel"
"Mr Einstein's secretary : an e…
"Turn a blind eye"
"Hidden in plain sight"
8.3.6 Question Six
Create a multi line chart showing the 7-day average and median checkouts for Toowong library.
Before computing rolling statistics, it is essential to filter the data for the Toowong library. If this step is skipped, the rolling average and median will be calculated across all libraries, producing incorrect results.
(library_checkouts .filter(pl.col('Library') =='Toowong') .sort('Date')1 .group_by_dynamic('Date', every='1d') .agg(Checkouts=pl.count('ID'))2 .with_columns(Average=pl.col('Checkouts').rolling_mean(window_size=7, min_samples=1), Median=pl.col('Checkouts').rolling_median(window_size=7, min_samples=1)) .hvplot.line(x='Date', y=['Average','Median'], legend='top_right', legend_opts={'title':''}, legend_cols=2, xlabel='', ylabel='Checkouts') .opts(title='Toowong 7-day average and median checkouts'))
1
Since the rolling calculations are based on days, the datetime granularity can be reduced from hours to days.
2
Setting min_samples=1 ensures that the calculation is performed as long as there is at least one non-null value in the window.
8.3.7 Question Seven
Use your local time zone and show the day and time when one of my favorite books, This Side of Paradise, was checked out.
When most people hear the name F. Scott Fitzgerald, they immediately think of The Great Gatsby. However, I would argue that his lesser-known novel, This Side of Paradise, is his best work. If you have not read it yet, give it a try. I will be waiting for your thank you email.
(library_checkouts1 .filter(pl.col('Title').str.contains(r'(?i)this side of paradise')) .with_columns(pl.col('Title').str.strip_chars('/ '))2 .with_columns(pl.col('Date').dt.convert_time_zone('America/Chicago'))3 .with_columns(Day=pl.col('Date').dt.strftime('%A'), Time=pl.col('Date').dt.time()) .select('Title','Date','Day','Time') )
1
Book titles often appear in datasets with inconsistent capitalization. Unless you know the exact casing, it is best to use a case-insensitive filter.
2
My local time zone is “America/Chicago”. If your time zone differs, your datetime values will also differ, but the structure of the dataframe will remain the same.
3
The format string %A returns the full name of the weekday.
shape: (7, 4)
Title
Date
Day
Time
str
datetime[ms, America/Chicago]
str
time
"This side of paradise"
2022-08-08 07:57:21 CDT
"Monday"
07:57:21
"This side of paradise"
2020-01-06 03:50:08 CST
"Monday"
03:50:08
"This side of paradise"
2020-03-06 05:20:32 CST
"Friday"
05:20:32
"This side of paradise"
2020-06-08 06:30:04 CDT
"Monday"
06:30:04
"This side of paradise"
2022-06-07 07:54:41 CDT
"Tuesday"
07:54:41
"This side of paradise"
2023-12-08 04:44:55 CST
"Friday"
04:44:55
"This side of paradise"
2024-05-08 21:35:33 CDT
"Wednesday"
21:35:33
8.3.8 Question Eight
Create a heatmap showing the frequency of checkouts, with days on the y-axis and months on the x-axis.
A heatmap uses color intensity to represent the magnitude of values in a two-dimensional matrix, transforming numerical data into an intuitive visual form. Heatmaps are especially useful for quickly identifying patterns, trends, and areas of high or low activity. In the heatmap below, darker colors indicate a higher number of checkouts for a given day and month combination.
Numerical representations of months and days are necessary for correct sorting. Sorting by month names alone would place April before January. The underscore prefix makes these helper columns easy to identify and remove later.
2
Sorting is performed using the numerical day and month values.
3
This removes all columns whose names start with an underscore.
8.3.9 Question Nine
Create a single column dataframe that shows a numbered list of Cal Newport books. The first row should be the full author name and book titles should be in the subsequent rows.
This question relies on intermediate variables. First, we extract all books written by Cal Newport and store them in a variable named cn_books.
Omitting this line would introduce a null value into the result set.
2
Some titles appear to be unique only because they include the text “FASTBACK”. Removing this text ensures that identical titles are treated as the same book.
3
This reverses the author name so it appears as first name followed by last name.
4
One title includes a colon before the subtitle while another does not. This standardization ensures both are treated as the same book.
The years 2021 and 2022 contain null values across the entire Item column, not just for biography records. This suggests that an error occurred during data collection for those years.
8.3.11 Bonus Question
Here is a bonus question to further test your Polars mastery. I will show you the output, but I will not provide the solution code. Instead, I am sharing my email address, www.contervalconsult@gmail.com so you can send me your solution.
I promise to reply with my own solution, especially if it differs from yours. This way, you can compare approaches and learn an alternative way to solve the problem.
Return the book title(s) whose number of digits is exactly half of the maximum digit count found in any title, along with the digit count. Exclude titles that consist only of digits.
The ten questions we explored were intentionally challenging, but they demonstrate that Polars is well suited for performing deep and expressive data analysis. Throughout these examples, we filtered data using datetime components, worked with text inconsistencies, computed rolling statistics, reshaped data for visualization, and combined multiple transformations into clear analytical pipelines.
One of Polars’ greatest strengths is its flexibility. The same problem can often be solved in several valid ways, each producing the same result. As you continue working with Polars, focus on identifying an approach that aligns with your own reasoning and feels natural to you. Avoid feeling overwhelmed by alternative solutions, especially early on.
The most important goal is to write code that produces the correct answer in a clear and reliable way. Once you are confident in the correctness of your results, you can revisit your solution to improve readability, performance, or expressiveness. By building this habit, you will develop both confidence and fluency in Polars, allowing you to tackle increasingly complex analytical questions with ease.